4 Main Analysis (Exploratory Data Analysis)
In our report, we have taken the Macro to Micro approach. We start with a brief overview of the whole league, then we narrow down to compared team-specific performances. Orignially, we have choosen all 31 team and try to analyze in our presentation. Then we realised that plotting 31 teams together actually makes it impossible to interpret. We therefore decide to analyze on the top 2 and bottom 2 teams in each region. Eventually, we zoom into the analysis of individual players. Occassionally, we may break this structure to provide you with a better visual comparison bewteen different seemingly random perspectives and explain how they are related.
4.1 Overview of the whole league
4.1.1 Total number pf games played vs number of wins
#number of games played vs number of wins
df1 = clutch[,c('GP','W','team')]
df1= gather(df1,type,count,-team)
#df1$count <- ifelse(df1$type =="W",df1$count*(-1),df1$count)
temp = df1[df1$type=='GP',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count, decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_bar(stat="identity",position="identity")+
xlab("number of games")+ylab("name of teams")+
scale_fill_manual(name="type of games",values = pal("five38"))+
coord_flip()+ggtitle("number of games played (GP) v.s number of wins (W)")+
geom_hline(yintercept=0)+
ylab("number of games")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
Unlike other leagues, NBA does not have a fixed number of games for each team. Therefore, it will be pointless to compared the performance of each team sololy based on the number of wins without considering the total matches played. From this simple plot, we can observe that the WAS played the largest number of games while GSW played the smallest number of games. Interestingly, these two team suffice to demonstrate our initial statement on absolute number of wins. WAS has the largest number of wins in the League. However, the highest winning rate comes from GSW despite its relative small number of wins.
4.1.2 Personal fouls (PF) and turnovers (TOV)
df1 = clutch[,c('PF','TOV','team')]
df1= gather(df1,type,count,-team)
df1$count <- ifelse(df1$type =="PF",df1$count*(-1),df1$count)
temp = temp = df1[df1$type=='TOV',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count, decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_bar(stat="identity",position="identity")+
xlab("counts")+ylab("name of teams")+
scale_fill_manual(values = pal("five38"))+
coord_flip()+ggtitle("Personal fouls (PF) and turnovers (TOV)")+
geom_hline(yintercept=0)+
ylab("counts")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
We have seen this graph above for rounding pattern. The reason we bring up this plot again is due to its relationship with the following plot on aggressiveness and defensiveness. From this plot, we can see that the upperside of the graph has higher TOV and higher PF on average. There is a slight positive correlation between the two statistics.
4.1.3 divergent plot on points decomposition
df1 = clutch[,c('PCT_PTS_2PT','PCT_PTS_3PT','PCT_PTS_FT','team')]
df1= gather(df1,type,count,-team)
temp = df1[df1$type=='PCT_PTS_2PT',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
df1$count <- ifelse(df1$type =="PCT_PTS_2PT",df1$count*(-1),df1$count)
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_col()+
xlab("percentage")+ylab("name of teams")+
scale_fill_manual(values = pal("five38"))+
coord_flip()+ggtitle("2PT%,3PT%,FT%")+
geom_hline(yintercept=0)+
ylab("percentage")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
This plot gives us a very direct visual presentation on the decomposition of points by different teams. There is a pretty obvious negative relationship between 2PT and FT. A team like TOR has the highest percentage of 2PT and a very low percentage of 3pt. However, CLE is completely opposite side. We have mentioned these two names as they will provide a very good interactive in the following plots on team aggressiveness and defensiveness.
4.1.4 Scatterplot on aggressiveness and defensiveness
library(png)
library(ggplot2)
library(gridGraphics)
library(ggimage)
path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
#img <- "https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/ATL.png?raw=true"
df1 = clutch[,c('OFF_RATING','DEF_RATING','team')]
df1$img = paste(path,df1$team,'.png?raw=true',sep='')
ggplot(df1,aes(x=OFF_RATING,y=DEF_RATING))+geom_point()+
scale_y_reverse()+geom_image(image = df1$img, size = .05)+
theme_scientific()+
xlab('offensive rating')+ylab('defensive rating')
In this part of the analysis, we will provide an analysis on the interaction between the previous three plots.
The scatter plot provide us a demonstration of how offensive or defensive the team is. We can observe that MIL is a very defensive team with a very low offensive rating. BOS is a very offensive team with the highest offensive rating.
Teams like OKC, SAS and WAS have high ratings for both scale. This is an indication of their strong performance in both defence and offence which is the quality of a strong team. This is further supported by our plot on total matches played and number of wins. WAS has the highest number of absolute wins. OKC and SAS have their winning rate among the top 5.
We would expect an aggressive team to have a higher number of personal foul. However, by comparing the plot on personal fouls and the offensive rating. There does not seem to be a direct relationship between them.
Interaction between scoring decomposition plots and agg-def plot. Does how aggressive or defensive a team affect the way they score? The answer is easy to tell by comparing the plots on scoring decomposition plots and agg-def. As we have mentioned before: TOR has the highest percentage of 2PT and a very low percentage of 3pt. However, CLE is completely opposite side. We can observe that TOR has a very high defensive rating while CLE has a very high offensive rating. One potential explanation will be that 3 pts is viewed as a much riskier and more offensive scoring method as compared to the much safer 2 pts.
4.1.5 Traditional measure on TSP VS PTS
# Define FGA: Field Goal Attempt
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
##==================================================================
#Plot on whole data, all teams
p_TSP = ggplot(df_pts_v1_2)+
geom_point(aes(overall,TSP,color = player_name),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "TSP V.S PTS Facet on Team",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs","Lakers","Suns","76ers","Nets")
TopLowP_TSP = df_pts_v1_2[df_pts_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_TSP = ggplot(TopLowP_TSP)+
geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape=Rank),size = 2)+
facet_wrap(~Team_Name)+
labs(title = "TSP V.S PTS Facet on X5min_plusminus_5 Top4Last4",x = 'X5min_plusminus_5 PTS', y='X5min_plusminus_5 TSP')
ggplotly(p_TSP)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
From a micro level, we can observe from the graph that in top4, the best players will take over the game in the clutch time, like Lebron Jame in Caverliers, Kyrie Irvine in Celtics, Kawhi Leonard in Spurs, Ste phen Curry and Kevin Durant in Warroris, the reason may be coaches usually trust the best players, and they will make most of the shoots. But it is worth to note that there are some good player in those team that can perform exceptionally well in clutch time, for example, Kyler Korver in Caverliers, Danny Green in Spurs, maybe they should share more shoots.
From a macro level, we can see that strong teams like Celtics and Spurs have a very high True shooting percentage. This is the traditional measure of the performance of a team. Moreover, we have analyzed before that spurs have a high 3pt ratio, yet the rate is so high, it is a reflection of the quality of the team members and leading to the good performance of the team.
Hence, in this part, we illsutrated that we should not look at the traditional data or our data alone. We should integrate them. Spurs true shooting rate is good on its own. However, coupled with its high 3 pcts attempts and its aggressive style, this makes it more valuable.
Moreover, if we look at TSP alone, we can actually find 76er have a pretty decent performance. However, if we cross-reference with its defensive strategy and high 2pct ratio. This figure may not be as convincing. This is one example of how we can integrate the traditional data and the alternative data.
4.2 Team specific analysis
In the section, we zoom down to the top 2 and bottom 2 teams in both the east and west regions. Instead of analyzing the tradtional team statistics, we choose to look at the team performance in the clutch time. Unlike other sports, the last few seconds in a basketball match can make a huge difference. Furthermore, NBA players do not have a huge difference in their performance in normal times as compared to other sports. In the clutch time, when every player is on their term, it is a true test of their mental stability, stamina and skills. Their difference in abilities and performance will be amplified in the final few seconds. Therefore, we believe that analyzing clutch time performance can give us great insight into the performance of the team.
4.2.1 3pcts vs 3fgm
#Plot on Top4 Last4
df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 = df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]
TopLowP_TSP = df_pct3_v1_2[df_pct3_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_3FGM3PCT = ggplot(TopLowP_TSP)+
geom_point(aes(df_3fgm_overall,overall,color = player_name,shape=Rank),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "3pct_overall V.S 3fgm_overall Facet on Top4Last4",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
This is a traditional method to analyze the performance of the team. 3 points is an important way to score in the basketball game and has a dominant effect on the final results of the game. Like our previous analysis, all top 4 teams have very high 3 poins rate. The rate is extremely high for Spurs which confirms our previous analysis.
###Team Average Overall fgm
##==================================================================
#Plot on All team
df_all$Team_Name.x = as.factor(df_all$Team_Name.x)
countorder = df_all %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))
#df_all = merge(df_fgm,df_pct,by = "player_id",all=TRUE)
ggplot(countorder, aes(reorder(Team_Name.x,av),av)) +
geom_col(color = "tomato", fill = "orange", alpha = .2)+
coord_flip()+
theme_scientific()+
labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')
##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = df_all[df_all$Team_Name.y %in% TopLowTeam,]
countorder = TopLowP_TSP_1 %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))
countorder['Rank'] = ifelse(countorder$Team_Name.x %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
#countorder
countorder
## # A tibble: 8 x 3
## Team_Name.x av Rank
## <fct> <dbl> <chr>
## 1 76ers 3.91 Down4
## 2 Cavaliers 4.23 Top4
## 3 Celtics 4.70 Top4
## 4 Lakers 4.56 Down4
## 5 Nets 3.35 Down4
## 6 Spurs 3.60 Top4
## 7 Suns 3.85 Down4
## 8 Warriors 4.00 Top4
ggplot(countorder, aes(reorder(Team_Name.x,av),av,fill = Rank)) +
geom_col()+
coord_flip()+
theme_scientific()+
labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')+
scale_colour_colorblind("Rank",
labels=countorder$Rank)
Team average overall fgm is a very important traditional factor to measure the performance of the team. We can observe that strong teams do have the tendency to have higher fgm. Spurs seems to be an outlier. However, if we combine our figure with our previous analysis on the agressvieness of Spurs, the high 3 points ratio and the high sucess rate. The relatively low overall fgm can be easily understood. This is another example of how we can link various part together to derive meaning results.
### Coordinates plot
# average within group 3point
cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
df_3fgm_sum = aggregate(df_3fgm[,3:12], list(df_3fgm$Team_Name), sum, na.rm = TRUE)
deno = df_3fgm/df_3pct[,1:13]
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Warning in Ops.factor(left, right): '/' not meaningful for factors
deno$player_name = df_3fgm$player_name
deno$player_id = df_3fgm$player_id
deno$Team_Name = df_3fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
average3point = df_3fgm_sum/deno_modi
## Warning in Ops.factor(left, right): '/' not meaningful for factors
average3point$Group.1=deno_modi$Group.1
average3point[is.na(average3point)] = 0
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
"Lakers","Suns","76ers","Nets")
TopLow3point = average3point[average3point$Group.1 %in% TopLowTeam,]
RK = ifelse(TopLow3point$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLow3point['TRk']= RK
#TopLow3point
p1 = ggparcoord(data = TopLow3point,
columns =2:7,
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average 3PT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLow3point$Group.1)
p2 = ggparcoord(data = TopLow3point,
columns =c(2,8:11),
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average 3PT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLow3point$Group.1)
# average within group all point
cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
df_fgm_sum = aggregate(df_fgm[,3:12], list(df_fgm$Team_Name), sum, na.rm = TRUE)
deno = df_fgm/df_pct[,1:13]
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## Warning in Ops.factor(left, right): '/' not meaningful for factors
deno$player_name = df_fgm$player_name
deno$player_id = df_fgm$player_id
deno$Team_Name = df_fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
averagepoint = df_fgm_sum/deno_modi
## Warning in Ops.factor(left, right): '/' not meaningful for factors
averagepoint$Group.1=deno_modi$Group.1
averagepoint[is.na(averagepoint)] = 0
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
"Lakers","Suns","76ers","Nets")
TopLowpoint = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]
RK = ifelse(TopLowpoint$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLowpoint['TRk']= RK
#averagepoint
p3 = ggparcoord(data = TopLowpoint,
columns =2:7,
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average TotalPT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLowpoint$Group.1)
p4 = ggparcoord(data = TopLowpoint,
columns =c(2,8:11),
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average TotalPT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLowpoint$Group.1)
grid.arrange(p1, p2, p3, p4, nrow = 2)
From this coordinates plot we can observe here that, traditonal performance measure in clutch time fails to gives us a good indication. This did not meet our expectation, as our original statement was to native to ignore why clutch time will happen in the first place. When a strong team enter clutch time, it is usually due to the major players are in bad shape that day or they will have finished the game in main time. That is why clutch time fails to give us a good indication.
4.2.2 Further analysis on 30s clutch time
##==================================================================
#Plot on ALL
df_pct['df_fgm_overall']=df_fgm$overall
df_pct_v1 = df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]
p_FGMPCT = ggplot(df_pct_v1_2)+
geom_point(aes(df_fgm_overall,overall,color = player_name),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "pct_overall V.S fgm_overall ",x = 'fgm_overall', y='pct_overall')
ggplotly(p_FGMPCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
df_pct['df_fgm_overall']=df_fgm$X30sec_plusminus_5
df_pct_v1 = df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]
TopLowP_TSP = df_pct_v1_2[df_pct_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_FGMPCT = ggplot(TopLowP_TSP)+
geom_point(aes(df_fgm_overall,X30sec_plusminus_5,color = player_name,shape=Rank),size = 2)+
facet_wrap(~Team_Name)+
labs(title = "pct_X30sec_plusminus_5 V.S fgm_X30sec_plusminus_5 Facet on Top4Last4",x = 'fgm_X30sec_plusminus_5', y='pct_X30sec_plusminus_5')
ggplotly(p_FGMPCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
In the plot, we take a deeper look at the final 30s when the score is tight. This situation is different from the situation above, because in last 30 seconds with plus or down 3 points, everything can happen. This is the real clutch time, but the same thing is that people usually think in this time we should give the ball to the best players to handle. The interesting is Warriors, who is the champion of the last season, the two best players in the team, Kevin Durant and Stephen Curry both have very low pct and fgm compared to the their normal statistics. This confirmed that our previous analysis when a strong team enters clutch time, the star players are usually not performing well that day. However Shawn Livingston the player with more than 10 years’ experience in NBA seems more productive in last 30 seconds’ clutch time. Same thing can be found in the other top 4 teams, veterans usually have better performance, like Al Horford in Celtics, Tony Park in Spurs, even though they are now not the one of the best players in the team, but they can be the best in the clutch time. Advice for coaches: give the ball to veterans and adjust your strategy based on the actual performance of the players on that day.
4.2.3 3pts average 10second down figure plot
##==================================================================
#Plot on All Teams
averagepoint=averagepoint[2:31,]
averagepoint['abbr'] = df_name_team_abbr[,1]
average3point=average3point[2:31,]
average3point['abbr'] = df_name_team_abbr[,1]
path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
averagepoint$img = paste(path,averagepoint$abbr,'.png?raw=true',sep='')
average3point$img = paste(path,average3point$abbr,'.png?raw=true',sep='')
##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]
TopLowP_TSP_2 = average3point[average3point$Group.1 %in% TopLowTeam,]
p3 = ggplot(TopLowP_TSP_1,aes(overall,X10sec_down_3))+
geom_point()+
geom_image(image = TopLowP_TSP_1$img,
size = .05)+
theme_scientific()+
labs(title = "3pt Average 10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')
p4 = ggplot(TopLowP_TSP_2,aes(overall,X10sec_down_3))+
geom_point()+
geom_image(image = TopLowP_TSP_2$img,
size = .05)+
theme_scientific()+
labs(title = "Total Average X10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')
grid.arrange(p3, p4, nrow = 1)
Although the tradtional method in general fails to give us the result we are looking for. The 3pt average performance in the last 10 seconds is highly correlated with the ranking of the team. This figure plot gives us a clear visual representation of the data. One potential reason for this will be strong teams usually have a greater player pool, they will have p points shooter designated for the final shoot. This is why strong team in general have a better last 10 second performance(despite the star players may not in a good shape as we have explained above)
4.3 Player specific analysis
As for individuals, we mainly covers the shooting pattern and missing rate. This will be covered in detail with our interactive components
4.4 Miscellaneous plots without significant discoveries
During our analysis, we have looked have a large number of plots and explored many different aspects. However, we cannot obtain meaningful patterns from some of them. We simply included them in this section to demonstrate the path we have taken.
### TSP VS PTS All Star
# Define FGA: Field Goal Attempt
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
##==================================================================
#Plot on whole data, all teams
p_TSP_All = ggplot(df_pts_v1_2)+
geom_point(aes(overall,TSP,color = player_name,shape = Team_Name),size = 2)+
labs(title = "TSP V.S PTS All Star",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.
### TSP VS PTS on X5min_plusminus_5
# Define FGA: Field Goal Attempt on X5min_plusminus_5
FGA = df_fgm$X5min_plusminus_5 / df_fct$X5min_plusminus_5
# Define TSP: True shooting percent
TSP = df_pts$X5min_plusminus_5/(2*(FGA+0.44*df_fta$X5min_plusminus_5))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
p_TSP_All = ggplot(df_pts_v1_2)+
geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape = Team_Name),size = 2)+
labs(title = "TSP V.S PTS All Star",x = 'X5min_plusminus_5 PTS', y='X5min_plusminus_5 TSP')
ggplotly(p_TSP_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.
### 3pcts_overall VS 3fgm_overall
##==================================================================
#Plot on All Team
df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 = df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]
p_3FGM3PCT_All = ggplot(df_pct3_v1_2)+
geom_point(aes(df_3fgm_overall,overall,color = player_name,shape = Team_Name),size = 2)+
labs(title = "3pct_overall V.S 3fgm_overall ",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.
### ftm_30sec_plusmiuns_5
##==================================================================
#Plot on All teams
df_fta['df_ftm_30sec_plusmiuns_5'] = df_ftm$X30sec_plusminus_5
df_fta_v1 = df_fta
df_fta_v1_2 = df_fta_v1[!is.na(df_fta$player_name),]
p_fta_ftm = ggplot(df_fta_v1_2)+
geom_point(aes(X30sec_plusminus_5,
df_ftm_30sec_plusmiuns_5,
color = player_name,
shape=Team_Name),
size = 1.3,
alpha=0.5,
position = "jitter")+
labs(title = "df_ftm_30sec_plusmiuns_5 V.S X30sec_plusminus_5 ",x = 'X30sec_plusminus_5', y='df_ftm_30sec_plusmiuns_5')
ggplotly(p_fta_ftm)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.
### 1min_down5 plot
#Plot on Top4 Last4
TopLowP_TSP_1 = df_pct[df_pct$Team_Name %in% TopLowTeam,]
ggplot()+
geom_point(data =TopLowP_TSP_1,
aes(x = X1min_down_5, y= overall),
position = position_jitter(w = 0.01, h = 0.02),
alpha = 0.5,
size = 3)+
facet_wrap(~Team_Name)+
labs(title = "overall V.S X1min_down_5",
x = 'X1min_down_5',
y='overall')
4.4.1 pair plots
pairs(df_all[c("X10sec_down_3.x","X10sec_down_3.y","X30sec_down_3.x","X30sec_down_3.y")])
#df_all
pairs(df_all[c("X1min_down_5.x","X1min_down_5.y",
"X3min._down_5.x","X3min._down_5.y",
"X5min._down_5.x","X5min._down_5.y")])
#df_all
pairs(df_all[c("X30sec_plusminus_5.x","X30sec_plusminus_5.y",
"X1min_plusminus_5.x","X1min_plusminus_5.y",
"X3min_plusminus_5.x","X3min_plusminus_5.y")])